The following data analysis of nitrate concentration spikes in water after rainfall events is an extension of our main data analysis from a gauge in Illinois. Since our main analysis only includes precipitation data through 2020, we used a different approach to get an estimation of what happened in the spring of 2021 at the same location.
Nitrate data comes from the US Geological Survey. Precipitation data comes from NOAA’s Local Climatological Data.
Since there are not nitrate monitoring locations in every county, we focused on doing a case study on the location that has the most data. To find this location, we used the following criteria:
Monitors the parameter “99133,” or nitrates in milligrams per Liter
Location is still active today
Had at least one nitrate spike above the federal threshold of 10mg/L this year
Its data goes as far back as possible
First, we loaded the R libraries we will use, including dataRetrieval. This is a package that allows us to request USGS data from R Studio.
library(dataRetrieval)
library(tidyverse)
library(lubridate)
library(rvest)
library(leaflet)
library(leaflet.providers)
library(dygraphs)
library(sp)
library(rgeos)
library(xts)
library(htmlwidgets)
library(data.table)
library(DT)Then, we will run the following code chunk. It uses a dataRetrieval. function to request all the USGS monitoring locations in Illinois and then filters them by keeping the ones that are still active and that have nitrate data. This will give us the site number, coordinates and how long each location has been active.
today <- Sys.Date()
IL_site <- readNWISdata(stateCd= "IL", parameterCd="99133",
service="site", seriesCatalogOutput=TRUE) %>%
filter(site_tp_cd == "ST") %>%
filter(end_date == "2021-09-08") %>%
filter(parm_cd == "99133") %>%
distinct(site_no, .keep_all = TRUE) %>%
select(site_no, station_nm, lat = dec_lat_va,
long = dec_long_va, begin_date) # 12 obs of 5 varThen we made this function that does the following:
Pulls all the site numbers for the state and requests nitrate data from this year.
It aggregates the data to find the maximum nitrate reading, yearlyPeak.
It creates a new column that detects if the maximum nitrate level in each location was above 10 mg/L or not.
It removes sites that did not have a nitrate peak higher than the federal threshold this year.
It arranges the remainder locations from oldest to newest and selects the top one.
This means the function will chose one location in the state that has data that goes as far back as possible and that had a nitrate spike above federal guidelines.
state_function <- function(state){
site <- state
# nitrate levels from this year
iv <- readNWISdata(siteNumbers = site$site_no, parameterCd = "99133",
startDate = "2021-01-01", endDate = "2021-08-18",
service = "iv")
# now aggregate by yearly max
iv$year <- year(iv$dateTime)
yearPeak <- iv %>%
group_by(site_no, year) %>%
summarise(yearlyPeak = max(X_99133_00000, na.rm = T)) %>%
filter(yearlyPeak <= 40) %>% # wonky filter
select(-year)
# join these to "site", sort by date, pick oldest one above 10mg/L
site <- site %>%
right_join(yearPeak) %>%
arrange(begin_date) %>%
mutate(illegal = case_when(
yearlyPeak >= 10 ~ "Yes",
yearlyPeak < 10 ~ "No")) %>%
filter(illegal == "Yes") %>%
select(-illegal) %>%
head(1) # this is our top location at this state
return(site)
}
# The following site has the data we are looking for:
IL_site <- state_function(IL_site) # 03336890 (IL)
IL_simple <- IL_site %>% select(name = station_nm,
lat,
long,
begin_date)
IL_simple$type <- "USGS"This is the location we will analyze in this case study:
After scraping the links to all the NOAA weather stations in IL, we imported them to extract information about the stations.
# import csv I manually scraped
airport_links <- read_csv("Data/NOAA/airport_links_clean.csv")
# filter to IL
airport_links <- airport_links %>% filter(state == "IL") # 62
# now make a list with all the links
airport_list <- as.list(airport_links$link)The following function scrapes the coordinates and start and end date from each station.
# make a function that will scrape both tables and join them, uses list index
link_function <- function(index){
station <- read_html(airport_list[[index]])
table <- station %>% html_table()
first_t <- table[[1]]
second_t <- table[[2]]
colnames(second_t) <- colnames(first_t)
combined <- rbind(first_t, second_t)
return(combined)
}
# apply function to each index with a for loop
a1 <- link_function(1)
for (i in 2:62) {
name <- paste("a", i, sep = "")
assign(name, link_function(i) %>% select(-1,))
}After combining the individual tables, we cleaned up the format and filtered out the stations that are no longer active.
# transpose
all <- as.data.frame(t(all))
# fix header and row names
colnames(all) <- all[1, ]
all <- all %>% filter(Name != "Name")
rownames(all) <- NULL
all <- all %>% select(name = Name,
lat_long = `Latitude/Longitude`,
begin_date = `Start Date¹`,
end_date = `End Date¹`)
# clean lat_long columns
all <- all %>% separate(lat_long, c("lat", "long"), ",")
all$lat <- gsub("°", "", all$lat)
all$long <- gsub("°", "", all$long)
# then remove by end date, not active now
all <- all %>% filter(end_date == "2021-09-06") %>%
select(-end_date) # 58 obs of 4 var
all$type <- "NOAA"
rm(airport_links, airport_list, i, name, today)After joining these 58 active weather stations to the nitrate gauge in Illinois, we can map them out.
# join
locations <- rbind(IL_simple, all)
locations$lat <- as.double(locations$lat)
locations$long <- as.double(locations$long)Since we have the coordinates for all those points, we can use the following chunk to find the weather station closest to the gauge, or its “neighbor.”
sp.locations <- locations
coordinates(sp.locations) <- ~long+lat
d <- gDistance(sp.locations, byid=T)
min.d <- apply(d, 1, function(x) order(x, decreasing=F)[2])
neighbors <- cbind(locations, locations[min.d,], apply(d, 1, function(x) sort(x, decreasing=F)[2]))
colnames(neighbors) <- c(colnames(locations),"neighbor", "lat2", "long2", "begin_date2", "type2", "apply")
neighbors <- filter(neighbors, type == "USGS") %>%
select(-apply)
rm(d, sp.locations, min.d, all, IL_simple)Now that we have the weather station closest to the gauge, we manually requested precipitation data for that location and then imported the file to R.
We then made a function to clean the format and filter precipitation to the spring of 2021, defined as April through June.
rantoul_IL <- fread("Data/NOAA/rantoul.csv") # 167,763 obs of 124 var
# make a function that will clean up NOAA data
# into the format we need
precipitation_function <- function(airport){
# remove some columns
airport <- airport %>% select(datetime = DATE,
precip_hour = HourlyPrecipitation)
airport$date <- date(airport$datetime)
# make NA into 0s
airport <- airport %>% mutate(precip_hour = replace_na(precip_hour, 0))
# aggregate to total daily precip
airport <- airport %>% group_by(date) %>%
summarise(total_precip = sum(precip_hour))
return(airport)
}
# now run the function in IL
rantoul_IL <- precipitation_function(rantoul_IL) # 2343 obs of 2 var
# now filter to spring this year
rantoul_IL$year <- year(rantoul_IL$date)
rantoul_IL$month <- month(rantoul_IL$date)
rantoul_IL <- rantoul_IL %>% filter(year == "2021") %>% filter(month %in% c(4:6)) %>%
select(-year) %>% select(-month) # 91 obs of 2 varOnce we had the precipitation data ready, we did a request for nitrate concentration in the gauge over the same period of time. Since this data is continuous (measured every 15 minutes), we aggregated the data to daily maximum nitrate levels to compare it to rainfall on the same granular level.
# request USGS from 2021-04-01 until 2021-06-29
usgs_IL <- readNWISdata(siteNumbers = "03336890", parameterCd = "99133",
startDate = "2021-04-01", endDate = "2021-06-29",
service = "iv")
# clean columns
usgs_IL <- usgs_IL %>% select(site_no,
datetime = dateTime,
nitrate_lv = X_99133_00000)
usgs_IL$date <- date(usgs_IL$datetime) # 8,639 obs of 4 var
# aggregate to peak nitrate levels per day
usgs_IL_peak <- usgs_IL %>% group_by(date) %>%
summarise(daily_peak = max(nitrate_lv, na.rm = T)) # 91 obs of 2 var
rm(usgs_IL)With both variables ready, we converted the tables to XTS format to create an interactive visualization.
# create the xts (format different than DF) necessary to use dygraph
nitrates_IL <- xts(x = usgs_IL_peak$daily_peak, order.by = usgs_IL_peak$date)
precipitation_IL <- xts(x = rantoul_IL$total_precip, order.by = rantoul_IL$date)
IL_xts <- merge(nitrates_IL, precipitation_IL, join = 'left')
rm(nitrates_IL, precipitation_IL, usgs_IL_peak, locations, neighbors)
# plot
IL <- dygraph(IL_xts, main = "Nitrates and precipitation in IL location, spring 2021") %>%
dyAxis("y", label = "Daily Peak Nitrates (mg/L)") %>%
dyAxis("y2", label = "Total Daily Precipitation (in)", independentTicks = TRUE) %>%
dySeries("precipitation_IL", axis = 'y2') %>%
dyRangeSelector(height = 20) %>%
dyHighlight(highlightCircleSize = 5) %>%
dyShading(from = 10, axis = "y",to = 50, color = "#FFE6E6") %>%
dyOptions(colors = c("#a97de3", "#6fc978"), drawGrid = F) %>%
dyLegend(labelsSeparateLines = T)In order to quantify the nitrate spikes and how those coincide with precipitation, we had to define the events through code. But first, we downloaded USGS nitrate concentration data for our site in IL. This time, we made a function to filter it by season, only keeping the spring months, defined as April through June.
We decided to analyze the spring data since that is when most farmers apply fertilizer and when climate assessments, such as the IL climate assessment, predict precipitation could keep increasing.
After filtering the spring data, this function aggregates that data into daily nitrate peaks, as earlier in the analysis.
# request IL nitrate data for this spring this year, April-June 2021
# IL: 03336890
spring_IL <- readNWISdata(siteNumber = "03336890", parameterCd = "99133",
startDate = "2021-04-01", endDate = "2021-06-29",
service = "iv")
spring_function <- function(spring_state){
# filter by month
spring_state$month <- month(spring_state$dateTime)
spring_state <- spring_state %>% filter(month %in% c(4:6))
# clean columns
spring_state <- spring_state %>% select(site_no,
datetime = dateTime,
nitrate_lv = X_99133_00000)
spring_state$date <- date(spring_state$datetime)
# aggregate by daily peak
spring_state <- spring_state %>% group_by(date) %>%
summarise(daily_peak = max(nitrate_lv, na.rm = T))
return(spring_state)
}
# apply function to IL
spring_IL <- spring_function(spring_IL) # 91 obsThis time we also incorporated nutrient load data. While concentration might be really important on a local health case aspect, given the federal limit on drinking water, the amount of nutrient loading is also relevant on a larger scale, such as its impact in the Gulf of Mexico’s hypoxia zone.
In Illinois, USGS stores load data as the parameter “91049.” This data has pounds of nitrate per day measured every 15 minutes. We aggregated these values as the average daily load to use it on the same granular level as our other variables.
Note: load data is not available throughout our entire date range. Some spike events will not include this information.
load_IL <- readNWISdata(siteNumbers = "03336890", parameterCd = "91049",
startDate = "2021-04-01", endDate = "2021-06-29",
service = "iv")
# make function to clean and aggregate to average load per day
load_function <- function(load_state){
# select and rename columns
load_state <- load_state %>% select(dateTime, load = X_91049_00000)
# aggregate to average load per day
load_state$date <- date(load_state$dateTime)
load_state <- load_state %>%
group_by(date) %>%
summarise(average_daily_load = mean(load))
# filter to same spring months
load_state$month <- month(load_state$date)
load_state <- load_state %>% filter(month %in% c(4:6)) %>% select(-month)
return(load_state)
}
# run function in IL
load_IL <- load_function(load_IL) # 91 obs
# combine them to rantoul precip data
spikes_IL <- left_join(spring_IL, rantoul_IL) %>%
left_join(load_IL) # 91 obs
rm(spring_IL, rantoul_IL, load_IL, IL_site, IL_xts)Once spring data sets (nitrate concentration, loading and precipitation) are merged, we can set parameters to define a nitrate spike event.
This event will be true when a day has a nitrate level above 10mg/L but that the day before the levels were below 10mg/L. We will start by using a lag function to have the previous days values in a new column and then detect a nitrate spike based on the criteria we defined.
spikes_IL <- spikes_IL %>%
mutate(nitrates_24b = lag(daily_peak, 1)) %>%
mutate(nitrate_spike = case_when(
(daily_peak >= 10 & nitrates_24b < 10) ~ TRUE
))Now we want to see how much it rained the day of the spike plus the day before, in case a rain event went overnight. We are using a lag function to aggregate those two days of rain into “pre_spike_rain.”
spikes_IL <- spikes_IL %>%
mutate(precipitation_24b = lag(total_precip, 1)) %>%
mutate(pre_spike_rain = total_precip + precipitation_24b)Then, we did a similar aggregation for the nitrate load. However, instead of adding the load from the day before when it started raining, we added the load from the day after the nitrate spike. We did this because in this watershed, there seems to be a bigger lag from when the rain starts to when the gauge records an increased load.
Several factors can affect this, such as the size of the watershed and the percentage of crop land in the watershed.
spikes_IL <- spikes_IL %>%
mutate(load_24a = lead(average_daily_load, 1)) %>%
mutate(post_spike_load = average_daily_load + load_24a)In the spring of 2021, we found 91 spike events.
# filter for TRUE spikes
spikes_IL <- spikes_IL %>% filter(nitrate_spike == T) %>%
select(date, nitrate_spike, daily_peak, pre_spike_rain, post_spike_load) # 42 obs
# and now filter for any spikes that had any amount of rain (just not 0)
spikes_IL <- spikes_IL %>% mutate(rained = case_when(
pre_spike_rain != 0 ~ "Yes",
pre_spike_rain == 0 ~ "No"
))
spikes_IL$rained <- as.factor(spikes_IL$rained)
rainy_spikes_IL <- spikes_IL %>% filter(rained == "Yes") # 4 obsWe also found that 4 out of those 4 spikes, or 100%, had any measure of rain in that day or the day before.
Here are more stats about those spikes:
## date nitrate_spike daily_peak pre_spike_rain
## Min. :2021-05-09 Mode:logical Min. :10.70 Min. :0.560
## 1st Qu.:2021-05-24 TRUE:4 1st Qu.:11.15 1st Qu.:1.288
## Median :2021-06-08 Median :11.50 Median :1.885
## Mean :2021-06-05 Mean :12.68 Mean :2.862
## 3rd Qu.:2021-06-20 3rd Qu.:13.03 3rd Qu.:3.460
## Max. :2021-06-25 Max. :17.00 Max. :7.120
## post_spike_load rained
## Min. : 4922 Yes:4
## 1st Qu.: 8156
## Median :17279
## Mean :26178
## 3rd Qu.:35301
## Max. :65233